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Abstract 


This  paper  reviews  some  interesting  recent  developments  in  the  field  of  imconstrained  optimization. 

First  we  discuss  some  recent  research  regarding  secant  (quasi-Newton)  methods.  This  includes  analysis 

that  has  led  to  an  improved  understanding  of  the  comparative  behavior  of  the  BFGS,  DFP,  and  other 

updates  in  the  Broyden  class,  as  well  as  computational  and  theoretical  work  that  has  led  to  a  revival  of 

interest  in  the  symmetric  rank  one  update.  Second  we  discuss  recent  research  in  methods  that  utilize 

second  derivatives.  We  describe  tensor  methods  for  unconstrained  optimization,  which  have  achieved 

considerable  gains  in  efficiency  by  augmenting  the  standard  quadratic  model  with  low  rank  third  and 

fourth  order  terms,  in  order  to  allow  the  model  to  interpolate  some  function  and  gradient  information 

from  previous  iterations.  Finally,  we  will  review  some  work  that  has  been  done  in  constnicting  general 

purpose  methods  for  solving  unconstrained  optimization  problems  on  parallel  computers.  This  research 

has  led  to  a  renewed  interest  in  various  ways  of  performing  the  linear  algebra  computations  in  secant 

/ 

methods,  and  to  new  algorithms  that  make  use  of  multiple  concurrent  function  evaluations.  i  -r. 


1.  Introduction 


This  paper  reviews  some  interesting  developments  in  the  solution  of  unconstrained  optimization 
problems  over  the  last  few  years.  The  unconstrained  optimization  problem  is 

given/ :/?"-»/?  ,  find  j:*  for  which /(;t.)^/(x )  forallx  6  D,  (1.1) 

where  D  is  an  open  neighboihood  containing  x* .  We  assume  that  the  fimction/  is  at  least  twice  continu¬ 
ously  differentiable,  even  though  the  analytic  derivatives  may  not  be  readily  available.  Our  orientation  is 
towards  problems  where  the  number  of  variables,  n ,  is  not  too  large,  say  n^lOO.  Even  if  n  is  not  large,  it 
is  often  the  case  in  practice  that  the  evaluation  of  f{x)  is  very  expensive,  so  it  is  important  that  algo¬ 
rithms  make  as  few  evaluations  of  /  (x)  and  its  derivatives  as  possible. 

In  Section  2  we  give  a  very  brief  and  superficial  summary  of  the  standard  methods  for  solving 
unconstrained  optimization  problems,  and  their  relative  advantages  and  disadvantages.  Readers  familiar 
with  unconstrained  optimization  may  wish  to  skip  this  section.  More  extensive  references  can  be  found 
in  various  books,  including  Fletcher  [1980],  GiU,  Murray,  and  Wright  [1981],  and  Dennis  and  Schnabel 
[1983]. 

Section  3  discusses  a  number  of  recent  interesting  research  developments  in  secant  methods  for 
unconstrained  optimization.  By  secant  methods,  we  mean  methods  that  update  an  approximation  to  the 
Hessian  matrix  at  each  iteration,  using  only  values  of  the  gradient  at  current  and  previous  iterates.  The 
developments  we  discuss  include  experiments  and  analysis  that  have  led  to  a  better  understanding  of  the 
comparative  behavior  of  the  BFGS,  DFP,  and  other  updates  in  the  Broyden  class,  as  well  as  new  theoreti¬ 
cal  and  experimental  research  related  to  the  symmetric  rank  one  update. 

In  Section  4  we  review  some  recent  research  into  second  derivative  methods,  methods  that  utilize 
the  analytic  or  finite  difference  value  of  the  Hessian  matrix  at  each  iteration.  We  concentrate  on  tensor 
methods,  a  new  class  of  methods  that  augment  the  standard  quadratic  model  with  low  rank  approxima¬ 
tions  to  higher  derivatives  at  each  iteration.  We  briefly  describe  these  methods  in  the  contexts  of  both 
nonlinear  equations  and  unconstrained  optimization. 

Finally,  Section  5  summarizes  some  research  in  the  fairly  new  field  of  parallel  methods  for  uncon¬ 
strained  optimization.  We  concentrate  on  parallel  secant  methods.  We  briefly  discuss  both  some  ways  to 
parallelize  the  linear  algebra  computations  in  these  methods,  and  some  approaches  that  make  use  of  mul¬ 
tiple  concurrent  function  evaluations. 
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2.  Background 

Algorithms  for  solving  the  unconstrained  optimization  problem  (1.1)  are  iterative.  The  basic 
hramevt^oik  of  an  iteration  of  the  methods  that  are  used  when  the  number  of  variables  is  not  too  large  is 
shown  in  Algorithm  2.1. 

At  the  highest  level,  two  aspects  of  Algorithm  2.1  require  elaboration.  The  first  is  the  method  for 
calculating  or  approximating  the  Hessian  matrix  V^f  (z+).  The  second  is  the  method  for  chosing  the  new 
iterate  x+.  In  this  section,  we  give  a  very  brief  description  of  the  main  alternatives  that  are  used  in  prac¬ 
tice,  and  how  they  compare. 

Methods  for  calculating  or  approximating  the  Hessian  matrix  can  be  divided  into  two  classes, 
second  derivative  methods  and  secant  methods.  In  second  derivative  methods,  one  calculatesthe  Hessian 
matrix  V^f  (x),  analytically  or  by  finite  differences,  at  each  iteration.  If  finite  differences  are  used,  this 
costs  either  n  additional  evaluations  of  the  gradient  V/  (x)  or  n^+  additional  evaluations  of  the  func¬ 
tion  /  (x )  at  each  iteration. 

In  secant  methods,  one  forms  an  approximation  H+  to  the  Hessian  V^/  (x+)  using  only  values  of  the 
gradieru  at  the  current  and  previous  iterates.  This  approximation  is  chosen  so  that  the  model  of  f(x) 
around  the  new  iterate  x+ , 


Algorithm  2.1  ••  Basic  Unconstrained  Optimization  Iteration 

given  current  iterate  Xc  ,  /  (Xc), 

gc  =  V/  (a:^)  or  finite  difference  approximation. 

He  =  (Xc )  or  finite  difference  approximation 
or  secant  approximation 

select  new  iterate  x+  by  a  line  search  or  trust  region  method 
(often  X^  =  Xe  -He'^gc) 

evaluate  g+  =  Vf  (a:+)  or  finite  difference  approximation  if  not  done  in  previous  step 
(f  (x+)  is  evaluated  in  previous  step) 

decide  whether  to  stop;  if  not 

calculate  //+  =  V^/  (j:^)  or  finite  difference  approximation 
or  secant  approximation 
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m(z++rf)=/(x+)  +  ^(x+)^d  -irlj^dTH^d,  (2.2) 

interpolates  not  only  the  function  and  gradient  values  at  ,  but  also  the  gradient  value  at  the  previous 
iterate  Xc.  This  interpolation  condition  is  satisfied  if  the  new  Hessian  approximation  satisfys  the 
secant  equation 

H^s^y  (2.3) 


where  j  =x.^-Xf  ,y  -g{x^)-g{Xe)  . 

The  most  commonly  used  Hessian  approximation  is  the  BFGS  update  (Broyden  [1970],  Hetcher  [1970], 
Goldfarb  [1970],  Shanno  [1970]) 


HcSS^H,  yyi- 
s^HcS  y^s 


(2.4) 


This  update  makes  a  symmetric,  rank  two  change  to  the  previous  Hessian  approximation  He ,  and  if  is 
positive  definite  and  s^y>0,  then  H+  is  positive  definite.  In  practice,  the  initial  approximation  is  sym¬ 
metric  and  positive  definite,  and  usually  s^y  is  greater  than  0,  otherwise  the  update  is  skipped.  Thus  each 
Hessian  approximation  in  a  BFGS  method  is  symmetric  and  positive  definite. 


A  brief  comparison  between  the  costs  and  theoretical  properties  of  second  derivative  and  secant 
methods  is  given  in  Table  2.5.  It  shows  three  main  differences  between  these  two  classes  of  methods. 
First,  second  derivative  methods  require  an  evaluation  of  the  Hessian  matrix  at  each  iteration,  while 
secant  methods  do  not.  Second,  second  derivative  methods  require  Oin^)  arithmetic  operations  at  each 
iteration,  because  they  must  faaorize  a  symmetric  matrix,  while  secant  methods  can  be  implemented  in 
0(n‘)  operations  at  each  iteration,  because  they  can  use  techniques  to  update  the  factorization  of  H^  into 
the  factorization  of  (see  e.g.  Gill,  Murray,  and  Wright  [1981]  or  Dennis  and  Sdinabel  [1983],  or  Sec¬ 
tion  5.1).  Third,  if  the  Hessian  matrix  at  the  solution  is  nonsingular,  the  evenmal  rate  of  convergence  of 
the  second  derivative  methods  is  quadratic,  while  secant  methods  such  as  the  BFGS  converge  at  a  slower 
superlinear  rate. 

Thus  secant  methods  cost  less  per  iteration  than  second  derivative  methods,  but  can  be  expected  to 
require  more  iterations.  Computational  experience  confirms  that  this  is  usuaUy  the  case.  In  experiments 
by  Schnabel,  Koontz,  and  Weiss  [1985]  on  a  set  of  standard  test  problems,  it  was  found  that  second 
derivative  methods  and  secant  methods  had  roughly  the  same  reliability  (ability  to  find  the  solution) ,  and 
that  the  number  of  iterations  required  by  secant  methods  was  usually  a  relatively  small  multiple  (less  than 


Table  2^  -  Comparison,  Second  Derivative  Methods  vs  Secant  Methods 


2nd  Derivative 

Secant 

Evaluations  f  (x) 

per  V/ (x) 

Iteration  V^f(x) 

Usually  1  or  2 

Usually  1 

1  1  0 

Arithmetic  Operations  per  Iteration 

(2-6) 

Storage 

n^/lor  n‘ 

Rate  of  Local  Convergence 

Quadratic 

-j)  times  the  number  of  iterations  required  by  second  derivative  methods.  This  implies  that  if  the  cost  of 
function  evaluation  is  dominant,  and  the  cost  of  an  analytic  or  finite  difference  Hessian  evaluation  is  at 
least  -y  times  the  cost  of  a  gradient  evaluation,  then  it  is  probably  preferable  to  use  secant  methods.  This 

is  the  case  when  the  Hessian  is  approximated  by  finite  differences.  If  function  evaluation  is  not  expen¬ 
sive,  or  if  the  cost  of  Hessian  evaluation  is  not  much  more  than  the  cost  of  gradient  evaluation,  then  either 
method  is  probably  satisfactory  with  second  derivative  methods  possibly  having  a  slight  advantage  in 
their  reliability.  It  is  our  understanding  that  other  computational  studies  have  come  to  similar  conclu¬ 
sions. 


The  other  aspect  of  Algorithm  2. 1  that  we  discuss  briefly  is  the  method  for  choosing  the  new  iterate 
Two  important  classes  of  methods,  line  search  methods  and  trust  region  methods,  are  used  in  prac¬ 
tice.  Roughly  speaking,  the  objective  of  either  type  of  method  is  that  close  to  the  solution,  the  new  iterate 
should  be  chosen  to  be  the  minimizer  of  the  model 

m(x,  +d)=f(.x,)-*-g{Xcfd  +  ^d^H,  d  (2.6) 

and  that  otherwise  the  new  iterate  x+  =  Xc  +d  should  be  some  value  for  which  f  (.x+)<f  (Xc ). 

In  a  standard  line  search  algorithm,  the  new  iterate  x+  is  chosen  to  be 

x^=Xc-K(ffc+Dcr^gc  (2.7) 


where  X,:  >0  is  the  step  length  and  (He  +D<;  )"*  gc  is  the  search  direction.  Here  the  diagonal  matrix  Dc  is 
0  if  He  is  safely  positive  definite,  and  is  a  non-negative  matrix  such  that  He  is  positive  definite  oth¬ 
erwise  (see  Gill,  Murray,  and  Wright  [1981],  Dennis  and  Schnabel  [1983]).  Thus  the  search  direction  is 
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always  a  descent  direction  for  /(x),  and  close  to  the  solution,  the  search  direction  is  the  Newton  direc¬ 
tion.  The  step  length  Xe  generally  is  chosen  so  that  two  conditions,  slightly  stronger  than  / (x+)  <f(Xc) 
and  s^y  >0,  are  satisfied,  and  so  that  =  1  is  used  if  it  is  acceptable.  It  has  been  shown  that  it  is  always 
possible  to  choose  to  satisfy  such  conditions,  and  that  many  such  strategies  for  choosing  the  step 
length  and  the  search  direction  cause  line  search  algorithms  to  be  both  globally  convergent,  and  to  retain 
fast  local  convergence. 

The  trust  region  approach  is  somewhat  differenL  Rather  than  first  choosing  a  search  direction  and 
then  a  step  length,  the  trust  region  algorithm  first  chooses  an  approximate  step  length  Ac,  and  then 
chooses  the  next  iterate  to  be  an  approximate  solution  to  the  problem 

mimmi2&f(Xc)+g(.Xc)^d+^d'’'Hcd  subjea to  II  d  II  .  (2.8) 

If  this  step  does  not  result  in  a  satisfactory  decrease  in  /  (x ),  the  trust  region  A^  is  reduced  and  problem 
(2.8)  is  solved  again.  If  is  satisfaaory,  the  trust  region  is  adjusted  for  the  next  iteration. 

The  solution  to  the  trust  region  problem  (2.8)  is  d=~Hc~^gc  if  He  is  positive  definite  and 
1 1  gg  1 1  ^  Ac ,  otherwise  it  is  a  pair  (d,  pig)  for  which 

iHe+ii}I)d=--gg  ,  (2.9) 

is  at  least  positive  semi-definite,  and  lid  II  =Ac  .  Since  this  step  is  expensive  to  calculate, 
trust  region  algonthms  usually  approximate  it  in  one  of  two  ways.  Either  they  calculate  a  d  which 
satisfies  (2.9)  lor  some  ^c  and  nas  length  approximately  equal  to  the  trust  radius,  or  they  restiia  the 
choice  of  d  to  a  two-dimensional  subspace,  as  in  the  "dogleg"  algorithms.  Many  strategies  for  approxi¬ 
mately  solving  the  trust  region  problem  (2.8),  and  for  adjusting  the  trust  radius  Ac,  have  been  shown  to 
lead  to  algorithms  that  are  globally  convergent  and  retain  fast  local  convergence.  For  more  information 
on  trust  regions  methods,  see  for  example  Dennis  and  Schnabel  [1983],  Mor6  and  Sorensen  [1983],  or 
Shultz.  Schnabel,  and  Byrd  [1985]. 

In  our  computational  experience  (see  e.g.,  Schnabel,  Kooniz,  and  Weiss  [1985]).  we  have  found  no 
systematic  differences  between  line  search  and  trust  region  algorithms.  This  is  true  both  in  the  case 
where  He  is  the  analytic  or  finite  difference  Hessian,  and  where  it  is  a  secant  approximation.  Sometimes, 
however,  there  are  substantial  differences  between  the  efficiency  of  line  search  and  trust  region  algo¬ 
rithms  on  specific  problems,  so  that  it  may  be  useful  to  have  both  options  available  in  software.  Line 
search  methods  may  enjoy  a  smaU  advantage  when  used  in  conjunction  with  secant  approximation.<:  to  the 
Hessian,  because  they  can  assure  that  the  necessary  and  sufficient  condition  for  a  positive  definite  secant 
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update,  >0,  is  satisfied  at  each  iteration,  while  mist  region  methods  cannot  do  this.  Conversely, 
trust  region  methods  may  enjoy  a  small  advantage  when  used  with  analytic  or  finite  difference  Hessians, 
because  they  seem  to  deal  more  directly  with  indefinite  Hessians. 


3.  Recent  Research  on  Secant  .Methods 

In  this  section,  we  review  some  interesting  recent  developments  concerning  secant  methods  for 
unconstrained  optimization.  These  developments  fall  into  two  categories.  First,  in  Section  3.1,  we  dis¬ 
cuss  several  research  contributions  that  have  helped  explain  the  differences  between  the  two  best  known 
secant  updates,  the  BFGS  and  the  DFP.  Some  of  this  work  has  also  considered  other  updates  in  the  Broy- 
den  class,  which  consists  of  all  linear  combinations  of  the  BFGS  and  DFP.  Secondly,  in  Section  3.2.  we 
discuss  some  research  that  has  caused  a  revival  of  interest  in  the  symmetric  rank  one  update. 


3.1  Understanding  the  difference  in  performance  between  the  BFGS,  DFP,  and  other  updates  in 
the  Broyden  class 


As  we  mentioned  in  Section  2.  secant  methods  are  commonly  used  to  solve  unconstrained  optimi¬ 
zation  problems  when  function  evaluation  is  expensive  and  analytic  values  of  the  Hessian  are  not  avail¬ 
able.  These  methods  update  an  approximation  He  to  V^/(Xc)  into  an  approximation  //*  to  {x^), 
using  values  of  the  gradient  at  and  jr+. 


The  most  commonly  used  secant  update  for  unconstrained  optimization  is  the  BFGS  update  (2.4). 
Among  its  important  properties  are  that  it  obeys  the  secant  equation  (2.3),  that  //+  differs  from  He  by  a 
symmetnc  matrix  of  rank  two.  and  that  is  positive  definite  as  long  as  is  positive  definite  and 
s^y  >0.  It  has  long  been  known  that  many  other  updates  have  these  same  algebraic  properties.  In  partic¬ 
ular,  the  DFP  update 


H  =H  \ 

*  ^  yTs 


yy^{y~HeS)^  s 
iy'^s)^ 


(3.1) 


(Davidon  [1959],  Fletcher  and  Powell  [1963]),  the  oldest  known  secant  update  for  unconstrained  optimi¬ 
zation.  is  another  rank  two  update  that  obeys  the  secant  equation  and  preserves  symmetry  and  positive 
definiteness  under  the  same  conditions  as  the  BFGS.  In  addition,  any  update  in  the  Broyden  class,  which 
consists  of  all  linear  combinations  of  the  BFGS  and  DFP 
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*  (1  (3.2) 

also  is  a  symmetric  rank  two  update  that  obeys  the  secant  equation.  Furthermore,  if  [0,1],  then  //+($) 
also  is  positive  definite  if  He  is  positive  definite  and  s^y>0.  All  the  updates  in  the  Broyden  class  also 
share  the  important  property  that  they  are  invariant  under  linear  transformations  of  the  variable  space  (see 
e.g.  Dennis  and  Schnabel  [1983]). 

For  some  time,  the  conventional  wisdom  has  been  that  the  BFGS  is  the  best  update  in  the  Broyden 
class  in  practice,  and  that  it  is  significantly  superior  to  the  DFP.  There  has  been  little  theoretical  analysis, 
however,  that  helps  explain  this  superiority.  Two  types  of  convergence  analysis  have  been  successfully 
applied  to  secant  methods.  One  type,  which  originated  with  the  work  of  Broyden.  Dennis,  and  Mord 
[1973],  proves  local  superlinear  convergence  for  a  direct  prediction  algorithm  (where  each  iterate  x*+i  = 
Xk  -  Hc^  V  f  (xt))  under  the  initial  assumptions  that  II  xq-x.  II  Se  and  II  H  Q-V^f  {xq)  II  <5  fore 
and  5  sufficiently  small.  Under  these  assumptions,  Broyden,  Dennis,  and  Mord  [1973]  proved  the  super- 
linear  convergence  of  the  iterates  produced  by  either  the  BFGS  or  the  DFP  update.  Stachurski  [1981]  and 
Griewank  and  Toint  [1982]  proved  analogous  results  for  any  $€  [0.1]  in  (3.2)  under  the  same  assump¬ 
tions.  Thus  this  convergence  theory  does  not  differentiate  at  all  between  the  BFGS,  DFP,  and  their  con¬ 
vex  combinatioiis. 


A  second  type  of  convergence  result  has  been  proven  by  Powell  [  1976].  Under  the  assumption  that 
/  (x )  is  convex,  he  establishes  both  global  and  local  superlinear  convergence  of  a  BFGS  algorithm  that 
uses  a  standard  line  search.  While  he  does  not  establish  this  result  for  the  DFP  or  any  other  update  in  the 
Broyden  class,  until  recently  there  was  no  clear  understanding  of  whether  or  not  the  result  could  be 
extended  to  other  updates.  Therefore  it  was  not  clear  whether  this  result  helped  distinguish  between  vari¬ 
ous  secant  updates. 


The  first  recent  contribution  that  we  discuss  towards  understanding  the  difference  between  the 
BFGS  and  other  updates  was  provided  by  Powell  [1986].  He  compares  the  behavior  of  the  BFGS  and  the 
DFP  when  solving  the  problem 


/  (x)  =x^  X.  n  -2, 


(3.3) 


under  the  assumption  that  the  iterates  are  chosen  by  xt*i  =x*  -  //*“'  V/  (x*).  (Note  that  due  to  the  scale 

x-i  ^ 

invariance  of  these  methods,  (3.4)  is  equivalent  to  / (x)=X]^  +  =  [ .)  Powell  shows  that  if  \  is 

A> 

much  greater  than  I .  and  if  a  pamculaiiy  difficult  staning  point. 
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(3.4) 


is  chosen,  then  the  BFGS  will  require  about  2.4  logioX  iterations  to  achieve  convergence,  whereas  the 
DFP  will  require  about  X  iterations.  If  X  is  much  less  than  1,  either  method  requires  at  most  about  10 
iterations. 


This  analysis  shows  a  potentially  huge  difference  between  the  performance  of  the  BFGS  and  the 
DFP  in  some  situations.  Table  3.5  summarizes  some  computations  performed  by  Powell  [1986]  to 
confirm  this  analysis.  The  first  two  lines  show  the  number  of  iterations  required  by  both  methods  to 
reduce  it  x  II  by  a  factor  of  Kf*.  using  xq  given  in  (3.4).  They  correspond  very  closely  to  the  estimates 
from  Powell's  analysis.  The  last  two  rows  show  the  number  of  iterations  required  to  reduce  1 1  x  1 1  by  a 


factor  of  ICf*  when  xo  = 


cos  40® 
sin  40" 


.  While  they  indicate  that  the  worst  case  is  somewhat  extreme,  they 


still  show  a  clear  superiority  of  the  BFGS. 


Thus  the  analysis  by  PoweU  [1986]  of  the  behavior  of  the  BFGS  and  DFP  on  the  problem  (3.3) 
gives  some  indication  of  a  fundamental  difference  between  these  two  updates.  A  second  recent  paper,  by 
Byrd,  Nocedal,  and  Yuan  [1987],  sheds  additional  light  on  the  subject. 

Byrd.  .Nocedal,  and  Yuan  extend  the  global  and  local  superlinear  convergence  of  Powell  [1976]  for 
the  BFGS  update  to  any  update  in  the  Broyden  class  (3.2)  with  $  e  [0,1),  that  is  any  convex  combination 


Table  3.5  --  Iterations  Required  for  1 1  x*  1 1  <  10~*  1 1  xq  1 1  on  Problem  (3.4) 

(from  Powell  [1986]) 


1  X  1  10* 

10^  1  lO^ 

10^  1  10^ 

Bad  BFGS 

8 

10 

15 

20 

1  -^<7 

DFP 

16 

107 

1006 

Average  BFGS 

6 

7 

7 

7 

DFP 

15 

19 

24 
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of  the  BFGS  and  the  DFP  except  for  the  DFP  itself.  The  techniques  they  use  to  prove  this  result  give 
some  interesting  insight  into  the  behavior  of  these  methods.  A  key  quantity  in  their  analysis  is  cos(6t) , 
where  6^  is  the  angle  between  the  step  direction  and  the  negative  gradient  direction  at  the  iteration.  If 
this  angle  is  less  than  any  a  <  90°  infinitely  often,  then  the  iterates  produced  by  any  standard  line  search 
method  will  be  globally  convergent  (Wolfe  [1969, 1971]).  Byrd,  Nocedal,  and  Yuan  show  that,  if  / (x)  is 
uniformly  convex  and  xt+i  =  x*  -  X*  V/  (x*),  then 


cos(e*)> 


c  •  Trace  (//*) 


(3.6) 


for  some  constant  c .  This  indicates  that  the  method  can  fail  to  have  the  desired  global  convergence  pro¬ 
perties  only  if  the  step  lengths  X*  become  arbitrarily  close  to  0,  or  if  the  trace  of  /f*  becomes  arbitrarily 
large.  Next  they  show  that  for  any  (|ie  [0.1],  the  ^ometric  mean  of  the  step  lengths  {X*  )  produced  by 
the  algorithm  is  bounded  below.  This  means  that  the  step  lengths  do  not  converge  to  0  for  any  e  [0,1], 
so  that  the  only  possible  impediment  to  convergence  for  any  such  update  is  Trace(//t).  Finally,  Byrd, 
Nocedal,  and  Yuan  show  that 


Trace  (//*^,)  <  Trace  (//*) - ,  +  r  (0* ,  X*  ,9*  ,//*)  (3.7) 

c  {cos 


where  :(^k  .X*  .0* ,  //*)  are  some  additional  terms  that  are  less  cracial  to  the  analysis.  Equation  (3.7)  indi¬ 
cates  that  if  the  method  takes  a  bad  step  (i.e.,  cos(0t)  close  to  0),  then  the  trace  of  //*+i  will  be 
significantly  less  than  the  trace  of  //*,  as  long  as  <{)<  1.  This  in  mm  can  be  used  to  show  that  for  any 
method  with  (be  [0,1],  there  cannot  be  too  many  bad  steps,  which  leads  to  both  global  and  local  super- 
linear  convergence. 


An  interesting  aspect  of  the  convergence  of  Byrd,  Nocedal.  and  Yuan  [1987]  is  that  it  shows  that 
secant  methods  with  <b<l  have  a  "self-correcting"  property  with  respect  to  Trace(//*)  that  becomes  less 
strong  as  (b  gets  closer  to  1,  and  is  not  present  for  (b=l.  the  DFP.  This  analysis  does  not  show  that  the 
DFP  fails  to  possess  the  same  global  and  local  conveigence  properties,  but  it  does  seem  to  point  out  a  fun¬ 
damental  deficiency  of  the  DFP  update. 


Byrd,  Nocedal,  and  Yuan  [1987]  also  provide  a  simple  example  that  shows  the  deterioration  of  the 
computational  {jerformance  of  secant  methods  as  (b  goes  from  0  to  1.  They  consider  the  function 

/(X)  =  y  X^X  +(0.1)(yX^ 

with  the  starting  values  Xo  =  ,//<,=  k  ■  Table  3.9  shows  the  number  of  iteraiiorts 

S1Z7  /C/  ^  it/ 
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required  by  an  unconstrained  optimization  method  with  a  modem  line  search  to  achieve 
II  zt  II  ^10~*ll  Xo  II  for  various  values  of  This  example  clearly  exhibits  a  deterioration  in 
efficiency  as  goes  from  0  to  1,  but  shows  that  this  deterioration  is  most  marked  very  close  to  the  DFP 


The  analysis  and  computational  example  of  Byrd,  Nocedal,  and  Yuan  [1987]  also  naturally  suggest 
that  one  might  try  values  of  (|><  0.  This  possibility  has  been  investigated  in  a  paper  by  Thang  and  Tewar- 
son  [1986].  They  suggest  a  heuristic  for  chosing  a  value  of  (^  <  0  in  (3.2),  and  show  that  on  a  set  of  test 
problems,  their  method  is  about  10%  to  15%  more  efficient  than  the  BFGS  on  the  average.  They  are  able 
to  prove  global  and  r -linear  convergence  as  long  as  9t  >  c*,  where  c*<0  is  a  quantity  that  is  computable 
in  practice,  but  can  only  show  superiinear  convergence  if  9*  ^  c* ,  where  c*  <0  is  not  computable  in  prac¬ 
tice.  Thus,  their  computational  and  theoretical  results  are  very  interesting,  but  given  the  Juristic  nature 
of  the  choice  of  <t>*  and  the  lack  of  a  fully  satisfactory  superiinear  convergence  result,  there  is  probably 
not  yet  strong  enough  computational  evidence  to  warrant  switching  from  the  BFGS  to  their  new  method. 

It  will  be  seen  that  the  same  contrast,  between  slightly  improved  computational  performance  and 
the  lack  of  fully  satisfactory  superiinear  convergence  results,  exists  for  some  methods  to  be  discussed  in 
Section  3.2. 

32  Recent  Research  on  the  Symmetric  Rank  One  Update 

It  has  long  been  known  that  there  is  one  rank  one  update  that  satisfies  the  secant  equation  (2.3)  and 
preserves  symmetry.  This  update  is  known  as  the  symmetric  rank  one  (SRI)  update. 


Table  3.9  -  Iterations  required  for  1 1  z*  1 1  ^  10^  1 1  Zo  1 1  on  Problem  3.8 
(from  Byrd,  Nocedal,  Yuan  [1987]) 


«D 

0 

0.2 

0.4 

0.6 

0.8 

0.9 

0.99 

0.999 

1 

iterations 

15 

21 

26 

32 

66 

115 

630 

2223 

4041 
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'  (y-HcS)(y-Hesf 
(y-Hesfs 


(3.10) 


While  this  update  has  not  been  used  much  in  practice,  some  recent  research  is  leading  to  a  revival  of 
interest  in  it.  In  this  section  we  briefly  review  the  properties  of  this  update,  and  then  discuss  this  recent 
research. 


It  is  straightforward  to  show  that  the  SRI  update  is  a  member  of  the  Broyden  class  (3.2),  with 


4»  = 


(y-Hcsfs 


(3.11) 


This  value  of  <j)  is  always  outside  the  range  [0,1],  as  long  as  He  is  positive  definite  and  s^y>0.  Thus  the 
SRI  update  is  not  covered  by  the  convergence  theory  mentioned  in  Section  2  or  3.1,  nor  is  it  guaranteed 
to  be  positive  definite  even  if  s^y>0.  These  properties  figure  prominently  in  the  main  advantages  and 
disadvantages  of  the  update. 


The  SRI  update  has  two  main  advantages.  First,  it  is  a  rank  one  modification  whereas  all  the  oflier 
members  of  the  Broyden  class  are  rank  two  modifications;  this  may  make  it  cheaper  to  implement. 
Second,  it  is  well  known  that  the  SRI  possesses  quadratic  termination,  meaning  that  if  it  is  applied  to  a 
quadratic  function  /  (x)  and  the  step  -He~^  Vfixe)  is  used  at  each  iteration,  then  in  exam  arithmetic,  the 
minimizer  will  be  found  exactly  in  n+l  or  fewer  iterations.  Furthermore,  if  n+l  iterations  are  required, 
then  the  final  Hessian  approximation  will  equal  the  exact  Hessian  V^f  (x ).  It  can  be  shown  that  no  update 
in  the  Broyden  class  that  always  preserves  positive  definiteness  has  this  quadratic  termination  property. 
This  at  least  raises  the  possibility  that  the  SRI  update  may  produce  more  accurate  Hessian  approxima¬ 
tions  than  other  updates  on  general  functions,  and  that  it  may  have  attractive  local  convergence  proper¬ 
ties. 


On  the  other  hand,  the  SRI  update  has  several  disadvantages.  First,  there  is  no  reason  why  the 
denominator  (y—HcV  s  cannot  be  zero  or  nearly  zero,  even  close  to  the  solution.  This  indicates  a  poten¬ 
tial  instability  in  the  update.  Secondly,  aside  from  the  quadratic  termination  result  mentioned  above,  no 
global  or  local  convergence  results  analogous  to  those  mentioned  in  Sections  2  and  3.1  have  been  esta¬ 
blished  for  the  SRI. 

Finally,  we  have  already  said  that  the  SRI  will  not  necessarily  yield  a  positive  definite  //+  even 
when  s  >0.  This  could  be  an  advantage  if  it  allows  the  update  to  better  model  the  actual  Hessian  when 
it  is  indefinite,  or  it  could  be  a  disadvantage  if  it  leads  to  an  indefinite  approximation  in  a  region  where 
the  actual  Hessian  is  positive  definite. 
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The  revival  of  interest  in  the  SRI  update  was  started  by  the  research  of  Conn,  Gould,  and  Toint 
[1986,  1987,  1988].  The  main  focus  of  their  research  is  somewhat  different  than  that  considered  in  this 
pa^r.  Conn,  Gould,  and  Toint  consider  the  bound-constrained  optimization  problem 

minimize /(z)  subject  to /,  Sz.  ^u,- ,i  =  l,...,rt  .  (3.12) 

Their  research  has  many  interesting  and  novel  aspects,  including  the  generalization  of  the  notion  of  a 
Cauchy  point  for  unconstrained  optimization  to  the  problem  (3.12),  the  use  of  inexaa  Newton  methods 
(methods  that  solve  the  linear  system  of  equations  associated  with  each  iteration  inexactly)  in  the  context 
of  problem  (3.12),  the  introduction  of  new  techniques  that  allow  large  changes  in  the  set  of  active  con¬ 
straints  at  each  iteration,  and  the  extension  of  the  known  global  convergence  theory  for  trust  region 
methods  for  imconstrained  optimization  to  problem  (3.12).  We  will  not  discuss  these  aspects  of  their 
research  further  since  they  are  outside  the  scope  of  this  paper. 

The  part  of  the  research  of  Corm,  Gould,  and  Toint  that  interests  us  most  from  the  perspective  of 
this  paper  is  one  of  their  computational  experiments.  Conn,  Gould,  and  Toint  [1986]  ran  their  algorithm 
for  problem  (3.12)  (a  trust  region  method  using  secant  updates  and  an  inexact  Newton  method)  on  50 
problems,  of  which  15  are  unconstrained.  They  tried  both  the  BFGS  and  the  SRI  updates.  A  stunmary  of 
their  results  is  given  in  Table  3.13. 

The  results  of  Conn,  Gould,  and  Toint  [1986]  show  a  large  overall  advantage  for  the  SRI  update  in 
comparison  to  the  BFGS.  The  advantage  is  great  on  problems  where  bounds  are  present,  while  the  two 
updates  appear  similar  on  unconstrained  problems. 

These  unconstrained  optimization  results  interested  us  considerably,  because  if  the  SRI  is  even 
competitive  with  tne  BFGS  in  general,  then  there  are  situations  where  it  may  be  preferable  due  to  its 
simpler  form,  quadratic  termination,  and  its  ability  to  reflect  indeflniteness.  Thus  we  decided  to  experi¬ 
ment  with  using  the  SRI  update  instead  of  the  BFGS  update  in  the  UNCMIN  code  (Schnabel,  Koontz, 
and  Weiss  [1985]),  a  fairly  standard  unconstrained  optimization  method.  The  results  of  running  UNC¬ 
MIN,  using  the  BFGS  and  the  SRI,  on  the  same  unconstrained  problems  as  were  used  by  Conn,  Gould, 
and  Toint  [1986]  are  shown  in  Table  3.14. 

The  UNCMIN  results  in  Table  3.14  are  very  different  than  the  unconstrained  optimization  results  in 
Table  3.13,  with  the  UNCMIN  results  strongly  favoring  the  BFGS  over  the  SRI.  The  reasons  for  the 
d  '^erence  in  the  comparative  performance  of  the  BFGS  and  SRI  updates  within  the  algorithms  of  Conn, 
Gould,  and  Toint  [  1986]  and  in  UNCMIN  are  not  clear.  It  should  be  stressed  that  these  two  algorithm  are 
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Table  3.13  -  Computational  Results  from  Conn,  Gould,  and  Toint  [1986] 


Problems 

BFGS 

Better 

SRI 

Better 

BFGS,  SRI 
Similar 

total  SRI  iterations 

total  BFGS  iterations 

All  (50) 

9 

30 

11 

0.71 

Unconstrained  (15) 

6 

1 

7 

2 

1.12 

Bounds  Present  (35) 

3 

23 

9 

0.56 

Table  3.14  -  UNCMIN  Results  on  the  Unconstrained  Problems  from  Table  3.13 


Global 

BFGS 

SRI 

BFGS.  SRI 

total  SRI  iterations 

Method 

Better 

Better 

Similar 

total  BFGS  iterations 

Trust  Region 

10 

1 

3 

2.66 

Line  Search 

12 

0 

2 

1.93 

considerably  different.  Most  importantly.  Conn,  Gould,  and  Toint  [1986]  use  an  inexact  Newton  strategy 
while  UNCME^  finds  the  minimizer  of  the  quadratic  model  exactly.  It  is  also  important  to  mention  that 
the  performances  of  the  BFGS  versions  of  the  two  algorithms  arc  fairly  similar  the  big  difference 
between  the  unconstrained  optimization  results  in  Tables  3.13  and  3.14  stems  from  the  faa  that  the  Conn, 
Gould,  and  Toint  algorithm  performs  much  better  using  the  SRI  than  UNCMIN  does  using  the  SRI.  We 
do  not  yet  understand  why  this  is  so,  nor  why  the  comparative  advantage  of  the  SRI  over  the  BFGS  in 
Conn,  Gould,  and  Toint’s  tests  is  so  much  bigger  for  problems  with  simple  bounds.  However  all  these 
results  do  seem  to  indicate  that  the  SRI  update  in  general,  and  the  above  questions  in  particular,  warrant 
additional  research. 

Another  interesting  aspect  of  the  research  of  Conn,  Gould,  and  Toint  [1987]  is  an  examination  of 
the  convergence  of  the  sequence  of  matrices  generated  by  the  SRI  update.  They  show  that  if  the 
sequence  of  iterates  (x* }  converges  to  a  strong  local  minimizer  x. ,  if  each  set  of  n  consecutive  steps 
(x,+i-x, },  i=k,  ■  ■  ■ ,  k-i-n-l  is  uniformly  linearly  independent,  and  if  the  denominators  of  (3.10)  are 
bounded  below  in  the  sense  that  I  (y* -//*  s*  I  II  II  II  for  all  /fc,  then  the 
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sequence  of  Hessian  approximations  [Hk }  generated  by  the  SRI  algorithm  converges  to  V^/  (x* ).  This  in 
turn  implies  that  die  rate  of  local  convergence  is  at  least  supei.  lear.  While  these  assumptions  are  strong, 
and  can  probably  not  be  guaranteed  to  be  satisfied  in  theory,  this  convergence  result  still  gives  an  indica¬ 
tion  of  what  might  often  happen  in  practice.  Indeed,  Conn.  Gould,  and  Toint  [1987]  conduct  some  exper¬ 
iments.  using  a  fourth  order  polynomial  for  /(x)  and  running  to  a  very  tight  convergence  tolerance, 
where  they  show  that  the  SRI  produces  final  Hessian  approximations  that  agree  with  the  actual  Hessian  at 
the  solution  to  within  between  KT*  and  whereas  the  final  approximations  produced  by  the  BFGS 
only  agree  to  about  10“^.  This  research  supports  the  hypothesis  that  the  quadratic  termination  property  of 
the  SRI  might  lead  to  better  final  Hessian  approximations  in  practice.  It  also  seems  to  further  indicate  a 
need  for  continued  research  on  the  role  of  the  SRI  update  in  unconstrained  optimization. 


We  conclude  this  section  by  discussing  a  second,  somewhat  different,  new  algorithm  involving  the 
SRI  update  that  was  recently  proposed  by  Osborne  and  Sun  [1988],  motivated  in  part  by  the  work  of 
Conn,  Gould,  and  Toint.  Osborne  and  Sun's  approach  is  to  use  the  SRI  in  such  a  way  that  it  always  pro¬ 
duces  positive  definite  Hessian  approximations.  They  do  this  by  first  multiplying  the  current  Hessian 
approximation  He  by  a  scale  factor  y>  0.  and  then  applying  the  SRI  update  to  y/Zc .  That  is 


)  (y-yHcS)(,y-yHcsf 
(y-yHesf  s 


(3.15) 


It  is  fairly  easy  to  see  that  if  //«  is  positive  definite  and  r^y>0,  then  H+  given  by  (3.15)  will  be 
positive  definite  if  y  is  either  sufficiently  large  or  sufficiently  close  to  0.  In  fact,  Osborne  and  Sun  [1988] 
show  that  //+  is  positive  definite  if  j  ^  y  >0  and 

ye  (0,-f^  )  or  ye  ,  oo).  (316) 

s'  He  s  s'  y 

Note  that  the  standard  SRI  update,  y=  1,  may  or  may  not  be  contained  in  one  of  the  two  intervals  in 
(3.16). 

Osborne  and  Sun  [1988]  propose  using  the  standard  SRI  update,  y=  1  in  (3.15),  if  it  satisfies  (3.16). 
Otherwise,  they  propose  choosing  the  value  of  y  that  satisfies  (3.16)  and  that  leads  to  the  optimally  condi¬ 
tioned  update  in  the  sense  proposed  by  Davidon  [1975],  namely  that  it  minimizes  the  1 2  condition  number 
of  He  ~^'^H+Hc  among  all  the  updates  of  the  form  (3.15).  Osborne  and  Sun  derive  a  closed  form  for 
this  optimal  value  of  y;  actually  there  are  two  values  of  y  that  yield  equally  optimal  solutions,  one  in  each 
of  the  intervals  in  (3.16). 
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Osborne  and  Sun  report  promising  computational  results  using  this  scaled  SRI  method  on  a  small 
set  of  test  problems.  We  have  tried  using  their  update  in  the  UNCMIN  code,  and  have  found  that  on  the 
average  it  leads  to  a  10-15%  improvement  over  the  BFGS  update  on  a  standard  set  of  test  problems. 
Therefore,  since  the  Osborne  and  Sun  algorithm  has  the  additional  advantage  in  comparison  to  die  BFGS 
that  it  only  requires  a  rank  one  update,  it  appears  to  merit  further  consideration.  On  the  other  hand,  the 
scaled  SRI  update  (3. IS)  shares  with  the  standard  SRI  update  (3.10)  the  apparent  disadvantage  that  no 
global  or  local  convergence  results  have  been  proven  for  it  under  the  standard  assumptions  that  are  used 
in  the  convergence  analysis  of  many  secant  methods,  including  the  BFGS  (see  Section  3.1).  Thus  more 
research  seems  necessary  to  understand  both  the  pracdcal  and  theoretical  properties  of  all  the  SRI 
methods  discussed  in  this  section. 


4.  Recent  Research  on  Second  Derivative  Methods  ~  Tensor  Methods 

In  this  section  we  turn  our  attention  to  second  derivative  methods,  methods  where  the  analytic  or 
finite  difference  Hessian  is  available  at  each  iteration.  We  discuss  one  recent  development,  the  develop¬ 
ment  of  tensor  methods.  This  is  a  class  of  methods  that  bases  each  iteration  upon  a  higher  order  model 
than  is  used  by  standard  methods.  The  higher  order  terms  in  this  model  are  chosen  so  that  the  model  is 
hardly  more  expensive  to  form,  store,  and  solve  than  the  standard  model. 

Tensor  methods  were  first  developed  by  Schnabel  and  Frank  [1984]  in  the  context  of  solving  sys¬ 
tems  of  nonlinear  equations.  These  methods  base  each  iteration  upon  a  quadratic  model,  rather  than  the 
linear  model  that  is  standard  for  solving  systems  of  nonlinear  equations.  Since  an  understanding  of  tensor 
methods  for  nonlinear  equations  is  helpful  in  understanding  the  more  complicated  tensor  methods  for 
unconstrained  optimization,  we  review  tensor  methods  for  nonlinear  equations  in  Section  4.1.  Then  in 
Section  4.2  we  describe  tensor  methods  for  unconstrained  optimization.  These  methods  base  each  itera¬ 
tion  upon  a  fourth  order  model,  rather  than  the  standard  quadratic  model  (2.6). 

We  note  that  another  recent  body  of  research  has  considered  the  use  of  higher  ordered  models  when 
the  objective  function  is  a  "factorable  function.”  See  for  example  Jackson  [1983],  Jackson  and  McCor¬ 
mick  [1986],  and  McCormick  [1983]. 
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4.1  Tensor  Methods  for  Nonlinear  Equations 
The  nonlinear  equations  problem  is 

given/" :/?" ,  find  X*  for  which /■(x»)  =  0  .  (4.1) 

When  the  Jacobian  matrix  F\xc)  is  available,  algorithms  for  solving  (4.1)  generally  base  each  iteration 
upon  the  linear  model 

A/(Xc+d)  =  F(Xc)  +  F'(x,)d  .  (4.2) 

This  model  requires  storage  locations,  and  arithmetic  operations  to  solve  it  at  each  iteration.  Its 

use  leads  to  quadratic  convergence  for  problems  where  F\x»)  is  non-singular,  but  at  best  linear  conver¬ 
gence  for  problems  where  F'(x»)  is  singular  (see  for  example  Decker  and  Kelly  [1980a,  b],  Griewank 
[1980]). 

The  tensor  method  proposed  by  Schnabel  and  Frank  [1984]  instead  bases  each  iteration  upon  the 

model 

M  ixc  +d)  =  F(jCc)  +  F\Xc)d  +'/iTcdd  (4.3) 

where  Te  €  R'»^^  is  a  three-dimension^  object  often  referred  to  as  a  tensor.  If  TV  =F'\xe),  then  (4.3) 
is  just  a  second  order  Taylor  series  model.  However  using  (4.3)  with  Te  -F'\xe)  is  not  practical,  as  it 
leads  to  huge  increases  in  the  costs  to  form,  store,  and  solve  the  model.  Instead,  Schnabel  and  Frank 
chose  Te  in  (4.3)  to  be  a  very  low  rank  approximation  to  F'\xe).  We  now  briefly  summarize  how  they 
make  this  choice,  and  some  of  its  consequences. 

Schnabel  and  Frank  [1984]  choose  Te  in  (4.3)  by  requiring  the  model  to  inteipolate  the  values  of 
F(x)  at  p  (not  necessarily  consecutive)  previous  iterates  x_i ,  ■  •  •  ,x-p.  They  impose  the  limit  p^'^n  , 
and  also  require  that  the  steps  5,-  =  Xc-x^  from  Xe  to  these  p  previous  iterates  be  strongly  linearly 
independent.  The  latter  condition  is  usually  much  more  restrictive  than  the  limit  p^^ ,  and  most  often 
results  in  the  choice  p=l,  meaning  that  only  information  from  the  most  recent  previous  iterate  is  used. 
Schnabel  and  Frank  then  choose  the  smallest  Te ,  in  the  Frobenius  norm,  that  satisfies  the  p  interpolation 

conditions  M (xc  -Si)-F{x-i),  r=l,  ■  ■  ,p.  The  result  is  a  rank  p  tensor  Te  of  the  form  Te  -  r,-  s,- , 

for  some  a,  e  Z?" ,  r  =  1 ,  ■  ■  •  ,p .  Thus  the  model  (4.3)  becomes 

M{Xe^d)  =  F{Xe)^F  Xxe  )d  '/i  ^  fl,-  {s^d  .  (4.4) 

The  additional  cost  of  forming  this  model  is  about  n^p  arithmetic  operations,  while  the  additional  storage 


cost  is  about  4/^  locations.  Both  of  these  are  small  in  comparison  to  the  basic  0(jt^)  arithmetic  per 
iteration  and  n?-  storage  costs  of  the  linear  model. 


Due  to  the  special  fonn  of  the  model  (4.4),  the  problem  of  finding  its  root  (or  of  minimizing 
1 1  Af  (j:c  +  d)  1 1  f  if  there  is  no  root )  can  be  reduced  to  solving  p  quadratic  equations  in  p  unknowns  plus 
n  -p  linear  equations  in  n-p  unknowns.  This  is  hardly  more  expensive  than  finding  the  root  of  the  linear 
model  (4.2),  requiring  only  about  n?-p  additional  arithmetic  operations  (recall  that  usually  p=l).  Further¬ 
more.  Schnabel  and  Frank  [1984]  show  that  the  solution  of  (4.4)  is  usually  well  posed  as  long  as  the  rank 
of  F\Xc)  is  at  least  n-p ,  whereas  the  solution  of  the  linear  model  (4.2)  only  is  well  posed  if  rank  {F\xc)) 
=  n. 


The  above  simply  shows  that,  by  utilizing  a  low  rank  quadratic  term,  it  is  possible  to  add  a  small 
amount  of  infonnation  to  the  linear  model  at  a  small  cost  What  is  perhaps  surprising  is  that  this  small 
amount  of  information  seems  to  lead  to  fairly  large  improvements  in  the  cost  of  solving  problem  (4.1)  in 
practice.  Schnabel  and  Frank  [1984]  compare  an  implementation  of  their  tensor  method,  using  a  standard 
line  search,  to  a  standard  method  for  nonlinear  equations  that  uses  the  linear  model  (4.2)  and  the  same 
line  search,  on  a  set  of  problems  from  Mord,  Garbow,  and  Hillstrom  [1981].  Their  results  are  summar¬ 
ized  in  Table  4.5;  for  more  details,  see  Schnabel  and  Frank  [1984]. 

Table  4.5  indicates  that  the  tensor  method  leads  to  consistent  and  rather  substantial  improvements 
in  efficiency  over  a  standard  derivative-based  method  for  solving  systems  of  nonlinear  equations. 


Table  4^  -  Comparison  of  Tensor  and  Standard  Methods  for  Nonlinear  Equations 

(from  Schnabel  and  Frank  [1984]) 


Rank 

F'(x•^ 

Problem 

Set 

Tensor 

Better 

Standard  i 
Better  i 

Two  Methods 
similar 

Average  Ratio  of  Tensor 
Iterations/Standard  Iterations 

n 

All 

21 

1  1 

6 

0.77 

Harder  Onlv* 

14 

0  1 

0 

0.61 

n-\ 

AU 

15 

0  1 

2 

0.58 

Harder  Only* 

9 

0  i 

0 

0.39 

n-2 

aU 

11 

2  1 

0 

0.63 

■  ■  ■  i 

7 

0  i 

0 

0.50 

♦Only  those  problems  where  slower  method  required  at  least  10  iterations. 
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Furthennore,  in  these  tests  the  number  of  past  points,  p ,  used  in  the  tensor  model  was  generally  1  and 
never  more  than  3,  so  that  the  additional  cost  of  using  the  tensor  method  was  low.  A  Fortran  code  for 
solving  systems  of  nonlinear  equations,  and  nonlinear  least  squares  problems,  by  tensor  methods  is  avail¬ 
able  from  the  author.  Our  positive  experience  with  tensor  methods  for  nonlinear  equations  also  has 
motivated  us  to  consider  using  tensor  methods  for  unconstrained  optimization,  which  are  described  in  the 
next  sectioa 

4J,  Tensor  Methods  for  Unconstrained  Optimization 

The  extension  of  the  tensor  method  for  nonlinear  equations  that  we  have  just  described  into  a  tensor 
method  for  unconstrained  optimization  brings  up  an  interesting  general  issue.  In  some  cases,  such  as  the 
basic  Newton  method,  methods  for  nonlinear  equations  and  for  unconstrained  optimization  are  very 
closely  related.  In  soaie  other  cases,  such  as  secant  methods,  methods  for  nonlinear  equations  and  uncon¬ 
strained  optimization  are  considerably  different.  These  differences  are  generally  due  to  the  considerations 
of  symmetry,  convexity,  and  positive  definiteness  which  are  present  for  unconstrained  optimization  but 
not  for  nonlinear  equations.  In  the  case  of  tensor  methods,  the  differences  between  unconstrained  optimi¬ 
zation  and  nonlinear  equations  cause  the  tensor  methods  for  the  two  problems  to  differ  significantly. 

The  obvious  extension  of  the  tensor  method  for  nonlinear  equations  to  the  unconstrained  optimiza¬ 
tion  problem  would  be  to  base  each  iteration  upon  the  quadratic  model  of  the  gradient 

VmiXc+d)  =  Vf{Xc)  +  '^'^f(Xc)d  +  '/iTcdd  .  (4.6) 

that  would  result  from  simply  substituting  Vf(x)  forF(x)  in  the  method  of  Section  4.1.  This  model  is 
unappealing  for  unconstrained  optimization  for  several  reasons.  First,  the  tensor  Tc  produced  by  the 
method  described  in  Section  4. 1  is  not  symmetric,  but  for  optimization  it  should  be.  Second,  we  will 
want  to  interpolate  past  values  of  /  (x)  as  well  as  V/(x),  so  a  procedure  based  solely  on  modeling  V/  (x) 
is  too  restrictive.  Most  importantly,  the  model  (4.6)  corresponds  to  using  a  third  order  model  of  /  (Xc), 
and  a  third  order  model  has  two  basic  deficiencies  for  unconstrained  optimization.  One  is  that  it  does  not 
have  a  global  minimizer.  A  second  is  that  it  does  supply  enough  information  to  lead  to  faster  than  linear 
convergence  for  problems  where  V^/  (x* )  is  singtilar.  For  this,  fourth  order  information  is  necessary  as 
well. 

For  these  reasons,  Schnabel  and  Chow  [1988]  propose  using  a  fourth  order  model  of/(x) 

M(XcHi)=/(Xc)  +  Vf(Xcfd  +  j-drV2/(x,)d  +  ^T,ddd  +  -^V,dddd  (4.7) 

where  Tc  e  is  a  symmetric  approximation  to  V^/(Xc)  and  Vc  e  is  a  symmetric 
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approximation  to  VY  (xe ).  While  this  model  may  appear  complicated,  the  cnicial  point  is  that  Tc  and 
will  again  be  low  rank  tensors.  We  will  see  that  this  again  makes  the  model  hardly  mote  expensive  to 
form,  store,  and  solve  than  the  standard  quadratic  model. 

To  form  the  tensor  model  (4.7),  Schnabel  and  Chow  require  it  to  interpolate  the  values  of  /  (x)  and 
V/  (x)  at  p  previous  iterates  x_, .  i =1,  •  •  •  .p .  These  iterates  are  chosen  by  the  same  criteria  as  in  the  ten¬ 
sor  method  for  nonlinear  equations,  namely  that  p  and  that  the  steps  r,-  =  Xe-x.,-  are  strongly  linearly 
independent.  In  practice,  usually  the  second  criterion  limits  p  to  1,  meaning  that  only  information  from 
the  most  recent  previous  iteration  is  used.  Next  Schnabel  and  Chow  choose  the  smallest  ,  in  the  Fro- 
benius  norm,  that  is  consistent  with  these  interpolation  conditions.  The  result  is  a  symmetric  rank  p  ten¬ 
sor  Vc  of  the  form 

~  *  (4.8) 

where  each  ot,  is  a  scalar.  Finally  they  choose  to  be  the  smallest  symmetric  tensor,  in  the  Frobenius 
norm,  that  is  consistent  with  the  choice  of  the  and  the  interpolation  conditions.  This  yields  a  rank  2p 
tensor  Tc  of  the  form 

Tc  ~  +SibiSi  "^SiSibi)  (4.9) 

where  each  6,  €  /? " .  This  order  of  choosing  Vc  and  Tc  causes  the  model  to  interpolate  as  much  informa¬ 
tion  as  possible  with  a  third  order  model,  and  to  use  the  fourth  order  term  only  where  a  third  order  model 
is  insufftcient 

Even  though  Tc  is  a  rank  2p  tensor,  Schnabel  and  Chow  [1988]  show  that  the  minimizer  of  the 
resultant  tensor  model  (4.7)  can  still  be  found  by  solving  p  cubic  equations  in  p  unknowns  and  n-p 
linear  equations  in  n-p  unknowns.  Thus  the  size  of  the  system  of  nonlinear  equations  that  must  be 
solved  to  mmimize  the  tensor  model  for  unconstrained  optimization  is  the  same  as  for  nonlinear  equa¬ 
tions,  with  the  difference  being  that  the  p  nonlinear  equations  are  cubic  rather  than  quadratic.  Since  p 
generally  is  1,  this  is  not  significant.  The  total  additional  costs  of  using  the  tensor  model  rather  than  the 
standard  quadratic  model  are  again  0{np)  storage  locations,  and  O(n^p)  arithmetic  operations  per  itera¬ 
tion.  This  is  again  minor  in  comparison  to  the  storage  locations  and  at  least  arithmetic  operations 
per  iteration  that  are  required  by  the  standard  method. 

Schnabel  and  Chow  [1988]  compare  an  implementation  of  their  tensor  method  for  unconstrained 
optimization  to  a  standard  method  based  on  a  quadratic  model,  on  the  problems  from  Moid,  Garbow.  and 
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Hillstrom  [1981].  Both  algorithms  use  the  same,  fairly  standard  trust  region  strategy.  A  surmnary  of 
their  computational  results  is  given  in  Table  4.10.  The  lines  labeled  "p  ^1"  show  the  results  when  p ,  the 
number  of  previous  iterates  whose  function  and  gradient  is  interpolated  is  interpolated,  is  aEowed  to 
exceed  one,  while  the  lines  labeled  "p=r  show  the  results  when  only  informadon  from  the  most  recent 
iterate  is  used. 

These  results  indicate  that  once  again,  the  tensor  method  seems  to  obtain  a  considerable  improve¬ 
ment  in  efficiency  from  using  a  rather  small  addidonal  amount  of  infoimauon.  Indeed,  the  tensor  method 
for  unconstrained  optimization  rarely  chooses  p>l  even  when  we  allow  this  possibility,  and  Table  4.10 
shows  if  we  require  p=l.  then  the  results  do  not  change  appreciably.  Thus  it  appears  that  we  can  achieve 
substantial  savings  at  a  very  low  cost 

It  is  not  yet  clear  to  us  why  tensor  methods  for  both  unconstrained  optimization  and  nonlinear 
equations  achieve  rather  large  improvements  in  efficiency  from  utilizing  a  rather  small  additional  amount 
of  informatiort  For  one  class  of  problems,  nonlinear  equations  with  rank(F'(x»))  =  n-l,  the  observed 
improvement  is  probably  due  to  a  faster  local  convergence  rate,  since  Frank  [1984]  shows  that  the  tensor 
method  achieves  3-sicp  superlinear  convergence  as  opposed  to  the  linear  rate  of  standard  methods.  This 
pattern  of  convergence  is  observed  in  practice,  and  may  also  occur  in  the  analogous  case  for  uncon¬ 
strained  optimization.  But  the  rate  of  convergence  of  the  tensor  methods  is  almost  certainly  not  superior 
to  :..andard  methods  when  F'(x* ),  or  V*/  (x. ),  is  non-singular,  and  yet  the  tensor  methods  usually  arc 
considerably  faster  in  these  cases  too.  Our  conjecture  is  that  this  improvement  occurs  because  the 


Table  4.10  --  Comparison  of  Tensor  and  Standard  Methods  for  Unconstrained  Optimization 

(from  Schnabel  and  Chow  [1988]) 


Rank 

V^f(x.) 

Value  of  p  in 
Tensor  Method 

Tensor 

Better 

Standard 

Better 

Two  Methods 
Similar 

Average  Ratio  of 
Tensor  Iterations  to 
Standard  Iterations 

n 

p  >1 

27 

4 

5 

0.65 

p=l 

27 

1 

7 

0.62 

n—\ 

p  >1 

27 

2 

4 

0.56 

1  p  =  1 

30 

3 

2 

0.55 

n-2  1  p  ^  1 

26 

2 

4 

0.59 

_ ^ _ 

28 

5 

_ 3 _ 

0.58 
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directions  of  two  consecutive  steps  are  often  rather  similar  in  practice,  so  that  the  additional  information 
provided  by  the  tensor  model,  which  is  along  the  previous  step  direction,  turns  out  to  be  especiaUy  useful. 
In  any  case,  it  appears  to  us  that  the  use  of  higher  order  models  seems  fhiitful.  and  that  additional 
research  on  such  approaches  is  warranted. 


5.  Parallel  Methods  for  Unconstrained  Optinuzation 

It  is  becoming  clear  that  the  fastest  computers  in  the  future  wUl  be  parallel  computers.  Therefore,  it 
is  namral  ^hat  there  has  been  increased  interest  recently  in  designing  parallel  optimization  algorithms.  In 
this  section  we  discuss  one  aspect  of  this  work,  parallel  methods  for  the  general  unconstrained  optimiza¬ 
tion  problem.  Some  of  this  research  will  be  seen  to  consist  of  the  development  of  new  algorithms 
motivated  by  the  consideration  of  parallel  computers,  while  other  parts  simply  consist  of  determining 
good  ways  to  parallelize  existing  sequential  methods. 

Before  we  begin,  we  need  to  clarify  which  of  the  various  types  of  parallel  computers  we  are  con¬ 
cerned  with.  At  a  high  level,  currently  available  parallel  computers  can  be  divided  into  three  categories, 
vector  computers,  processor  arrays,  and  multiprocessors.  Vector  computers  can  perform  pairwise  addi¬ 
tion  or  multiplication  of  long  vectors  of  numbers  quickly.  They  are  best  suited  to  a  low  level,  fine  grain 
type  of  parallelism.  Processor  arrays  (SIMD  computers)  are  computers  with  many  processors  that  can 
perform  the  same  operation  in  lock  step  on  multiple  data  at  the  same  time.  Thus  they  are  also  suited  to  a 
fairly  low  level,  fine  to  medium  grain  type  of  parallelism.  By  multiprocessors  (MIMD  computers),  we 
mean  computers  that  can  perform  entirely  different  operations  on  different  data  at  the  same  time.  These 
include  both  shared  memory  multiprocessors,  where  all  the  processors  share  access  to  some  of  the 
memory,  and  distributed  memory  multiprocessors,  where  each  processor  has  its  own  memory  and  there  is 
no  shared  memory.  These  computers  support  a  higher  level,  medium  to  coarse  grain  type  of  parallelism. 

Among  these  various  types  of  parallel  architectures,  the  type  of  optimization  problems  we  consider 
in  this  paper  seem  best  suited  to  solution  on  (MIMD)  multiprocessors.  This  is  because  parallel  algorithms 
for  general  unconstrained  optimization  problems  generally  seem  to  entail  a  high  level,  coarse  grain  type 
of  parallelism.  Largely  this  is  because  the  evaluation  of  the  objective  function  /(x),  an  atomic  operation 
in  these  algorithms,  is  itself  often  a  lengthy  calculation.  Furthermore,  if  we  wish  to  perform  two  or  more 
computations  of  /(x)  concurrently,  this  will  often  require  an  MIMD  computer  since  for  two  different 
values  of  X ,  the  program  that  evaluates  /  (x )  may  well  execute  different  sequences  of  instructions.  Thus 
the  discussion  in  the  remainder  of  this  section  is  mainly  oriented  towards  parallel  computation  on 


multiprocessois,  though  the  specific  type  of  multiprocessor,  for  example  shared  or  distributed  memory,  is 
relatively  unimportant 

The  remainder  of  this  section  will  discuss  parallel  methods  that  are  related  to  the  most  commonly 
used  general  purpose  unconstrained  optimization  method,  the  BFGS  method  with  a  line  search.  A  high 
level  description  of  this  method  is  given  in  Algorithm  (5.1). 

Since  presumably  we  are  interested  in  parallel  optimization  in  order  to  solve  expensive  problems, 
we  need  to  focus  on  why  the  BFGS  method  can  be  expensive.  There  are  two  main  reasons.  One  is  that 
the  function  and  derivative  evaluations  are  expensive.  The  second  is  that  the  linear  algebra  computations, 
namely  updating  the  Hessian  approximation  and  calculating  the  search  direction,  are  expensive  because 
the  number  of  variables  is  large. 

There  are  several  obvious  possibilities  for  adapting  these  potentially  expensive  portions  of  the 
BFGS  algorithm  to  parallel  computers.  The  first  is  to  parallelize  the  individual  calculations  of  the  objec¬ 
tive  function  /  (x).  This  may  or  may  not  be  feasible,  depending  upon  the  form  of  f  (x),  the  availability  of 
pre-existing  parallel  routines  to  calculate  it,  or  the  interest  of  the  user  in  devising  new  parallel  routines  to 
evaluate  it.  In  any  case,  it  is  outside  the  scope  of  basic  optimization  research.  Thus  for  the  remainder  of 
this  section  we  wiU  assume  that  /  (z)  is  evaluated  on  one  processor. 


Algorithm  5.1  -  Iteration  of  the  BFGS  Method  with  Line  Search 

given  current  iterate  ,  /  (xc), 

gc  =  V/  (;Cc )  or  finite  difference  approximation. 

He  =  symmetric  positive  definite  Hessian  approximation 

calculate  search  direction  dc : 

solve  He  de  =  -gc 
line  search; 

find  X  >  0  for  which  z*  =  Xc  +  V  dc  is  satisfactory  next  iterate 
evaluate  g+  =  ^f  (z+)  or  finite  difference  approximation  if  not  done  in  previous  step 
(/(z+)  is  evaluated  in  previous  step) 
decide  whether  to  stop;  if  not 

update  Hessian  approximation  by  BFGS  formula  (2.4): 

H^  =  He  +  rank  two  matrix 
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The  second  possibility  is  to  parallelize  the  dominant  linear  algebraic  computations  of  the  BFGS 
method,  mainly  the  Hessian  updates  and  the  calculations  of  the  step  directions.  We  consider  this  topic  in 
Section  5.1.  While  it  basically  consists  of  ways  to  parallelize  the  existing  method,  it  brings  up  some 
interesting  fundamental  issues  in  unconstrained  optimization. 

The  third  possibility  is  to  perform  multiple  evaluations  of  the  objective  function  /  (x)  concurrently. 
The  possibilities  here  range  from  calculating  several  components  of  the  finite  difference  gradient  con¬ 
currently  to  new  algorithms  that  make  use  of  concurrent  function  evaluations.  We  consider  both  of  these 
possibilities  in  Section  5.2. 

The  material  in  this  section  is  taken  largely  from  Schnabel  [1987]  and  Byrd,  Schnabel,  and  Shultz 
[1988a,  b]. 

5.1  Parallel  Methods  for  the  Linear  Algebra  Calculations  in  Secant  Methods 

In  this  section,  we  consider  the  paraUelizadon  of  the  linear  algebra  calculations  in  the  standard 
BFGS  method.  This  topic  is  interesting  because  it  leads  us  to  re-examine  various  ways  for  implementing 
these  calculations.  It  also  leads  us  to  interpret  a  recent  proposal  of  Han  [1986]  in  a  different  light 

It  is  convenient  to  view  the  linear  algebra  calculations  involved  in  an  iteration  of  the  BFGS  method 
as  the  update  of  the  Hessian  approximation  followed  by  the  calculation  of  the  next  search  direction. 
Table  5.2  summarizes  four  different  ways  of  implementing  these  calculations,  along  with  their  costs. 
These  four  options  arise  from  choosing  between  two  possibilities  each  for  two  orthogonal  attributes  of 
how  the  Hessian  approximation  may  be  stored.  First,  one  can  store  either  an  approximation  to  the  Hes¬ 
sian  itself  or  an  approximation  to  the  inverse  of  the  Hessian.  Second,  this  approximation  can  either  be 
kept  in  the  straightforward,  unfaaored,  form,  or  one  can  store  a  matrix  M  from  a  factorization  MM^  of 
the  Hessian  or  the  inverse  Hessian  approximatioa 

The  upper  left  variant  in  Table  5.2  is  the  most  obvious  way  to  implement  the  BFGS  method.  It 
simply  consists  of  performing  the  update  (2.4)  to  obtain  //+,  and  then  performing  a  Cholesky  factorization 
of  H+  to  solve  H^d^=-g^  for  .  This  is  the  only  variant  that  requires  0(,n^)  arithmetic  operations;  all 
the  others  require  only  O  (n‘). 


The  upper  right  and  lower  left  variants  in  Table  5.2  are  the  two  well-known  ways  for  performing 
the  linear  algebra  calculations  in  a  BFGS  algorithm  in  O(n^)  operations.  The  lower  left  variant  was  the 
first  known  O(n^)  implementation  of  the  BFGS;  it  consists  of  using  the  formula  that  gives  the  effect  of 
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Table  SJ.  -  Four  Possible  Implementations  of  the  Linear  Algebra  Calculations  : 

//t+i  =  //*  +  rank-two-matrix 
solve  Ht+i  dt+i  =  -gi+i  for  dt+i 

i 

1 

j 

Matrix  Stored  Unfactored 

Matrix  Stored  Faaored 

1 

(//*  stored,  updated  to  //t+i) 

1 

(Lk  lower  triangular  stored,  for  which  //*  =LkLj , 
updated  to  Lk.^\  lower  triangular  for  which 
^*+1  =  Lt+i  L/T+i  ) 

Direct 

(//*) 

//*+!  =Hic  +  rank-two 

Jk+\  =  L*  -1-  rank-one 

Update 

Cholesky  factor //*+! 

•^*+1  =  Qjk+i  Lt+1  by  Givens  rotations 

2  triangular  solves  to  find  <i*+i 

2  triangular  solves  to  find  dk+i 

6n^  (2.5n^) 

{HiT^  stored,  updated  to  ) 

(Mk  stored  for  which  //r’  =  Mk  Mi , 
updated  to  Af*+i  for  which  Hki\  =  Mk+\Mi^i ) 

Inverse 

Hir}[  =  //*"'  +  rank-two 

A/t+i  =:  Mk  +  rank-one 

Update 

Matrix-vector  multiply  to  find  dt+i 

1 

2  Matrix-vector  multiples  to  find  dk-t-i 

2n- 

4n2 

the  BFGS  update  on  the  inverses  of  and  H 
//Z'  =  //c"‘  + 


t  -  w  -I  j-  (■y-^c~*y  _  ss'^js-H^^yfy 


(5.3) 


and  then  multiplying  by  //+ '  to  get  The  upper  right  variant  was  subsequently  discovered  (Gill 
and  Murray  [1972],  Goldfarb  [1976]),  and  has  become  the  favored  0{n?-)  method  for  implementing  the 
BFGS.  It  consists  of  directly  updating  the  Cholesky  faaorization  LL"^  of  He  into  the  Cholesky  factoriza¬ 
tion  L+Ll  of  //*.  This  is  accomplished  by  expressing  the  BFGS  update  as  a  rank  one  change  to  the 
Cholesky  factor  L ,  resulting  in  a  non-tri angular  matrix  /+,  and  then  transforming  7+  into  a  lower  triangu¬ 
lar  matrix  L*  through  a  series  of  2rt-2  Given’s  rotations.  Then  d+  is  found  by  using  two  backsolves  to 
solve  L^Ll  d+=-g^  for  An  advantage  of  this  implementation  is  that  by  always  keeping  a  Cholesky 
factorization,  it  implicitly  guarantees  that  the  Hessian  approximation  is  numerically  positive  definite. 
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(For  details  on  the  operation  counts,  see  Byrd,  Schnabel,  and  Shultz  [1988b].) 

The  fourth,  lower  right,  variant  is  closely  related  to  the  upper  right  variant  It  consists  of  updating  a 
fattorization  of  directly  into  a  factorization  Af+A/+^  of  by  making  a  rank  one  change  to 
Af ,  and  then  multiplying  by  to  obtain  Since  the  savings  that  would  occur  in  these 

matrix  vector  multiplications  if  Af  +  were  triangular  would  be  exceeded  by  the  cost  of  restoring  Af+  to  tri¬ 
angularity  at  each  iteration,  it  is  preferable  to  omit  the  Given’s  rotations  that  are  used  in  the  upper  right 
variant  and  simply  store  A/  as  a  full  matrix. 

To  our  knowledge,  the  lower  right  variant  has  hardly  been  considered  in  the  form  in  which  we  have 
just  described  it,  but  it  is  equivalent  to  a  suggestion  of  Han  [1986].  Han  proposes  implementing  the  linear 
algebra  of  the  BFGS  on  parallel  computers  by  keeping  a  matrix  Zc  for  which 

ZjHcZc^I,  (5.4) 

and  then  updating  it  to  a  matrix  Z+  for  which 

ZjH^Z^  =  I  (5.5) 

by  making  a  rank  one  change  to  Zc  to  obtain  Z+ .  But  since  equations  (5.4)  and  (5.5)  are  equivalent  to 
He"'  -ZcZe^  and  =Z+Z+^,  this  is  simply  another  way  of  interpreting  the  lower  right  variant  in 
Table  5.2,  and  it  is  easy  to  confirm  that  the  calculations  in  Han’s  method  and  the  lower  right  variant  in 
Table  5.2  are  identical. 

These  four  possible  ways  of  implementing  the  linear  algebra  calculations  of  the  BFGS  method  pro¬ 
duce  identical  results  in  exact  arithmetic.  They  differ  in  the  number  of  arithmetic  operations  they  requite, 
in  their  suitability  to  parallel  computation,  and  perhaps  in  their  accuracy  when  implemented  in  computer 
arithmetic.  None  appears  to  be  best  in  all  these  regards.  There  has  long  been  a  belief  that  the  two  left 
hand  variants  in  Table  5.2  might  have  problems  in  finite  precision  arithmetic  because  they  might  produce 
numerically  non-positive  definite  Hessian  approximations.  Powell  [1987]  has  also  shown  that  the  two 
bottom  variants  may  have  difficulties  in  finite  precision  arithmetic  handling  very  badly  scaled  problems. 
The  combination  of  these  two  observations  thus  leads  one  to  favor  the  upper  right  variant  and  this  is  the 
conventional  wisdom. 

In  several  computational  tests  that  have  compared  the  use  of  these  four  variants  in  a  BFGS  method, 
however,  the  differences  between  them  have  turned  out  to  be  negligible,  even  on  somewhat  extreme  test 
problems.  These  include  tests  performed  by  Grandinetti  [1978],  Connolly  and  Nocedal  [1987],  and  tests 
we  recently  performed  using  all  four  variants  of  the  BFGS  shown  in  Table  5.2  in  the  UNCMIN  code.  In 
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our  tests,  on  no  problem  was  there  more  than  a  1*2%  variation  between  the  costs  of  solving  it  using  these 
four  different  implementations. 

Thus  we  consider  any  of  the  variants  in  Table  5.2  to  be  a  suitable  candidate  for  the  implementation 
of  the  BFGS  method  on  a  parallel  computer.  Among  these  four,  the  bottom  two  variants  clearly  are  the 
most  amenable  to  parallelization.  Both  require  only  matrix  vector  multiplications  and  rank  one  updates, 
computations  that  can  be  easily  and  efficiently  parallelized  over  a  large  variety  of  parallel  architectures 
and  problem  sizes.  The  upper  right  variant  appears  far  more  difficult  to  parallelize  efficiently  due  to  the 
need  to  perform  the  2n-2  Given’s  rotations,  on  vectors  of  size  2  through  n ,  sequentially.  The  upper  left 
variant  is  clearly  too  expensive  to  consider  since  some  of  the  much  less  expensive  sequential  variants 
parallelize  well.  The  difference  in  efficiency  between  the  two  bottom  variants  on  parallel  computers  can 
be  expected  to  be  about  a  factor  of  two  (but  only  4/3  on  a  distributed  memory  computer,  see  Byrd,  Schna¬ 
bel.  and  Shultz  [1988b]),  so  we  would  tend  to  favor  the  lower  left  variant,  but  experiments  using  the  two 
bottom  variants  on  parallel  computers  need  to  be  performed. 

Thus  the  discussion  of  this  section  has  shown  that  the  consideration  of  parallel  computation  may 
lead  to  a  different  choice  for  implementing  the  linear  algebraic  computations  in  the  BFGS  method  than  is 
generally  made  on  sequential  computers.  It  also  makes  it  clear  that,  to  conclusively  resolve  this  issue, 
optimization  researchers  need  to  carefully  consider  whether  and  when  the  inverse  updates,  faaored  or 
unfactored,  have  practical  computational  deficiencies. 

5,2  Utilizing  Concurrent  Function  Evaluations  for  Unconstrained  Optimization 

In  this  section  we  review  some  possibilities  for  utilizing  concurrent  function  evaluations  in  general 
purpose  unconstrained  optimization  algorithms.  These  possibilities  include  both  the  parallelization  of 
standard  optimization  algorithms,  and  new  algorithms  motivated  by  the  consideration  of  parallel  compu¬ 
tation.  Our  point  of  departure  is  the  standard  BFGS  method  described  in  Algorithm  5.1. 

Recall  first  the  pattern  of  function  and  gradient  evaluations  in  a  standard  secant  method  like  Algo¬ 
rithm  5.1.  Each  iteration  performs  one  or  more  trial  point  function  evaluations,  evaluations  of  /  (Xc+Xdc) 
for  some  values  of  the  step  length  parameter  X,  within  the  line  search.  The  last  of  these  is  at  the  success¬ 
ful  next  iterate  x+ ,  and  is  followed  by  the  gradient  evaluation  at  x+ .  Generally  this  is  the  only  gradient 
evaluation  in  the  iteration.  In  our  experience,  the  average  number  of  trial  point  function  evaluations  per 
iteration  over  the  course  of  the  algorithm  tends  to  be  between  1.2  and  1.5. 
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We  are  most  concerned  with  performing  these  function  and  derivative  evaluations  efficiently  on  a 
parallel  computer  if  they  are  expensive.  In  our  practical  experience,  when  /  (x)  is  expensive.  V/  (x)  usu¬ 
ally  is  calculated  by  the  finite  difference  approximation 


V/(x).= 


/(x+/i.«i)-/(x) 

- s; - 


r  =  1.  •  •  • ,  n 


(5.6) 


where  e,  is  the  unit  vector  and  hi  is  an  t^propriately  chosen  finite  difference  step  size.  This  requires 
n  additional  evaluations  of  /  (x).  Geaily  these  function  evaluations  can  be  performed  concurrently  on  a 
parallel  computer.  Thus  on  a  machine  with  p  processors,  the  finite  difference  gradient  evaluation 

concurrent  function  evaluation  steps,  steps  where  each  processor  performs  at  most  one 


reqmres 


evaluation  of /  (x )  concurrently. 


Thus  if  /  (x )  is  expensive  and  the  number  of  processors,  p ,  is  much  less  than  n .  then  simply  paral¬ 
lelizing  the  finite  difference  gradient  calculation  by  performing  groups  of  p  function  evaluations  con¬ 
currently  leads  to  very  good  speedups.  If  p  >=n ,  however,  the  overall  speedup  for  all  the  function  and 
gradient  evaluation  steps  will  be  no  better  than  about  half  of  optimal.  This  is  because  all  but  one  of  the 
processors  will  be  unused  during  the  trial  point  function  evaluations,  and  there  will  be  at  least  as  many 
steps  devoted  to  trial  point  function  evaluations  as  to  gradient  evaluations. 

This  leads  to  the  obvious  question,  "How  can  we  utilize  additional  processors  while  evaluating 
/  (xe  +  A,  dc )  on  one  processor  in  the  line  search?"  There  are  two  obvious  possibilities.  The  most  cotrunon 
suggestion  (see  for  example  Dixon  and  Patel  [1982],  or  Lootsma  [1984])  is  to  supplement  the  line  search 
by  evaluating  /(x)  at  p-1  other  points  simultaneously.  We  refer  to  this  strategy  as  a  multiple  point 
search. 

A  second  possibility,  originally  suggested  by  Schnabel  [1987],  is  to  use  the  remaining  p-1  proces¬ 
sors  to  evaluate  (part  of)  V/  (xc  +  A,  dc )  by  finite  differences  b^ore  it  is  known  whether  this  gradient  value 
will  be  needed.  We  refer  to  this  as  a  speculative  (partial)  gradient  evaluation.  If  Xc+Xde  is  accepted  as 
the  next  iterate,  as  it  usually  will  be.  then  the  function  evaluations  that  have  been  perfomied  in  the  specu¬ 
lative  gradient  evaluation  have  all  turned  out  to  be  necessary  ones.  If  Xc+Xdc  is  not  accepted  as  the  next 
iterate,  then  the  speculative  gradient  evaluation  has  been  unnecessary,  although  Byrd,  Schnabel,  and 
Shultz  [1988b]  describe  a  new  method  that  makes  some  beneficial  use  of  this  information. 


We  believe  that  using  the  extra  processors  that  are  available,  while  evaluating  /  (xc  Jc)  in  the 
line  search,  to  perform  a  speculative  finite  difference  gradient  evaluation  will  usually  result  in  a  more 
efficient  parallel  algorithm  than  utilizing  the  extra  processors  to  perform  a  multiple  point  search.  Indeed, 
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in  Older  for  the  multiple  point  search  to  be  the  superior  strategy,  it  would  need  to  reduce  the  overall 
number  of  iterations  by  a  factor  of  almost  (assuming  p^-t-1).  In  this  case,  the  multiple  point 

search  would  lead  to  a  better  sequential  algorithm  as  well.  We  consider  this  unlikely,  especially  since  it  is 
usually  very  hard  to  improve  upon  the  choice  1  close  to  the  solution.  We  cannot  make  a  more  definite 
assessment  because,  to  our  knowledge,  proposers  of  multiple  point  searches  have  not  provided  the  data 
that  would  allow  us  to  compare  their  strategies  to  speculative  gradient  evaluations. 

If  p^+1  and  gradients  are  evaluated  by  finite  differences,  then  we  feel  there  is  little  more  to  say 
about  utilizing  concurrent  function  evaluations  for  unconstrained  optimization.  The  remaining  interesting 
cases  are  when  p  >/i+l,  or  when  /  (a;)  and  V/  (x)  are  naturally  evaluated  together  utilizing  one  processor. 
In  either  of  these  cases,  additional  processors  are  available  while  /(x)  and  V/  (x)  are  being  evaluated. 
Byrd.  Schnabel,  and  Shultz  [1988a,  b]  consider  several  ways  to  utilize  these  additional  processors.  In  the 
remainder  of  this  :ection  we  briefly  summarize  aome  of  their  suggestions. 

If  there  are  enough  processors  so  that  the  function,  gradient,  and  finite  difference  Hessian  can  all  be 
evaluated  at  once,  then  the  analogous  idea  to  that  discussed  above  is  to  perform  speculative  finite  differ¬ 
ence  Hessian  evaluations.  (Finite  difference  approximation  of  the  Hessian  requires  additional 

evaluations  of  / (x)  or  n  additional  evaluations  of  Vf  (x).)  This  means  evaluating  all  of  (Xe+Xdc ), 

as  well  as  V/  (x^  +Xdc).  concurrently  with  the  trial  point  function  evaluation / (x^  +Xde).  If  Zc  +X.de  is 
accepted  as  the  new  iterate  x^. .  as  it  usually  will  be.  all  the  speculative  evaluations  perform  useful  work. 
Note  that  this  is  basically  just  a  way  to  parallelize  a  standard  second  derivative  method,  although  new 
algorithmic  features  could  be  introduced  to  attempt  to  the  speculative  gradient  arxl  Hessian  information  at 
unsuccessful  trial  points. 

If  there  are  more  processors  than  necessary  to  evaluate  /(z)  and  V/  (z)  simultaneously,  but  not 
enough  to  evaluate  the  entire  finite  difference  Hessian  simultaneously  as  well,  then  Byrd,  Schnabel,  and 
Shultz  [1988a,  b]  propose  utilizing  the  remaining  processors  to  perform  speculative  evaluation  of  part  of 
the  finite  difference  Hessian  at  each  iteration.  This  leads  to  the  consideration  of  new  optimization  algo¬ 
rithms.  Now  we  briefly  review  this  work 

Byrd,  Schnabel,  and  Shultz  [1988a]  consider  many  alternatives  for  evaluating  part  of  the  Hessian  at 
each  iteration,  and  for  incorporating  this  information  into  an  optimization  algorithm.  The  strategy  for 
evaluating  part  of  the  Hessian  which  they  find  to  be  most  successful  is  simply  to  evaluate  some  set  of 
columns  of  the  Hessian  at  each  iteration,  with  these  sets  sweeping  througli  all  the  columns  as  the  itera¬ 
tions  proceed.  At  any  given  iteration,  this  means  one  evaluates  r,  =VV(z)  c,-  for  some  values 
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i  €  ledlji  ]•  The  strategy  for  incorporating  this  information  that  Byrd,  Schnabel,  and  Shultz  [1988a,  b] 
find  best  is  to  first  update  //  to  //  by  the  standard  BFGS  update  (2.4),  utilizing  the  normal  secant  equation 
(2.3).  Then  they  update  H  to  H+by  performing  a  multiple  BFGS  update  (Schnabel  [1983]),  which  causes 
H^.to  satisfy  H^ei- z,  for  each  i  e  4 .  The  rationale  for  inserting  the  information  in  this  order  is  that  the 
normal  BFGS  update  gives  information  about  the  Hessian  between  and  x^. ,  while  the  finite  difference 
Hessian  information  gives  values  at  x+ .  Thus  the  finite  difference  information  is  inserted  last 

Byrd,  Schnabel,  and  Schultz  [1988b]  show  that  an  algorithm  that  utilizes  the  above  strategy  for 
incorporating  part  of  the  finite  difference  Hessian  at  each  iteration  retains  the  superlinear  convergence 
rate  of  the  BFGS  method.  Of  course  the  intent  is  that  the  new  method  should  perform  better  than  the 
BFGS  in  practice,  since  it  uses  more  information,  but  this  is  about  the  best  theoretical  result  one  can 
expect. 

The  results  of  extensive  computational  tests  utilizing  the  partial  Hessian  methods  described  above 
are  reported  in  Byrd,  Schnabel,  and  Shultz  [1988a,  b].  Table  5.7  summarizes  their  results  on  a  set  of  test 
problems  with  n=20.  The  second  row  shows  the  speedup  of  the  new  method,  that  evaluates  the  function, 
gradieiu,  and  q  columns  of  the  finite  difference  Hessian  at  each  trial  point,  over  a  parallel  BFGS  method 
that  just  evaluates  the  function  and  gradient  at  each  trial  point,  under  the  assumption  that  there  are  enough 
processors  to  evaluate  /(x),  V/  (x),  and  q  columns  of  the  finite  difference  Hessian  concurrently.  The 
third  row  shows  the  speedups  of  the  same  partial  Hessian  methods,  under  the  same  assumptions  about  the 


Table  5.7  -  Average  Speedup  of  a  Method  Using  Speculative  Partial  Hessian  Evaluations 
on  a  Test  Set  with  n=2Q  (speed  measured  in  function  evaluations) 

(from  Byrd,  Schnabel,  and  Shultz  [1988b]) 


Number  of  columns  of  Hessian 
calculated  at  each  iteration 

0 

1 

2 

3 

■1 

5 

20* 

Speedup  over  Parallel  BFGS 
utilizing  speculative  gradient  evaluation 

m 

1.8 

111 

2.9 

3.0 

3.2 

6.0 

Speedup  over  Sequential  BFGS 
with  finite  difference  gradients 

17.5 

31.5 

40.3 

50.8 

52.5 

56.0 

105 

Number  of  processors  required  to 
evaluate  /  (x ),  V/  (x ),  V^/  (x ) 
concurrently  from  function  values 

21 

42 

62 

81 

99 

116 

231 

♦Newton’s  method 
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number  of  processors,  in  comparison  to  a  standard  sequendal  BFGS  method  that  performs  finite  differ¬ 
ence  gradient  evaluations  and  utilizes  only  one  processor.  These  numbers  in  the  third  row  are  17.5  times 
as  large  as  the  numbers  in  the  second  row.  This  reflects  the  faa  that  the  parallel  BFGS  is  17.5  times  as 
fast  as  the  sequential  BFGS  on  these  problems  on  the  average,  or  equivalently,  that  there  are  an  average 
of  1.21  trial  point  function  evaluations  per  iteration  on  these  problems.  The  final  row  shows  the  number 
of  processors  that  would  be  necessary  to  implement  the  parallel  partial  Hessian  algorithms  for  each  value 
of  q ,  when  the  finite  difference  gradient  and  Hessian  are  computed  from  function  values. 

The  results  in  Table  5.7  show  that  the  new  method  that  utilizes  otherwise  idle  processors  to  perform 
spectilative  partial  Hessian  evaluations  is  more  efficient,  in  terms  of  concurrent  function  evaluation  steps, 
than  a  standard  method  that  doesn't  utilize  these  processors,  but  that  these  improvements  are  not  propor¬ 
tional  to  the  amount  of  new  information  that  is  being  utilized.  This  is  to  be  expected,  however,  because  as 

the  table  shows,  Newton’s  method,  which  uses  times  as  much  information  as  the  BFGS  method,  is 

only  about  6  times  as  fast  on  the  average  on  these  problems.  Combining  the  last  two  rows  of  Table  5.7 
shows  that  the  speedups  are  at  least  0.45  of  the  optimal  in  all  cases,  which  is  considered  reasonable  in 
parallel  computation.  These  results  also  indicate,  however,  that  there  is  still  an  opportunity  to  find  new 
algorithms  that  make  better  use  of  partial  Hessian  information. 


6.  Concluding  Remarks 

Recent  research  in  unconstrained  optimization  has  demonstrated  that  even  though  the  field  has 
reached  a  faiiiy  mature  state,  many  interesting  and  potentially  fruitful  research  possibilities  still  exist  The 
research  on  the  BFGS,  DFP,  and  their  convex  combinations  show  that  there  are  still  fundamental  theoreti¬ 
cal  issues  concerning  secant  updates  that  need  to  be  better  understood.  The  research  on  the  SRI  and  on 
updates  beyond  the  BFGS  (<t><l)  illustrates  that  there  may  be  updates  that  perform  better  than  the  BFGS 
in  practice,  but  that  we  need  to  understand  these  updates  bener  in  both  practice  and  theory.  It  appears 
increasingly  unlikely,  however,  that  any  new  secant  update  will  result  in  a  large  improvement,  s  y  greater 
than  25%,  over  the  BFGS. 

The  research  on  derivative  tensor  methods  for  nonlinear  equations  and  unconstrained  optimization 
has  led  to  surprisingly  large  improvements  in  efficiency  over  standard  methods,  often  in  the  range  of  30- 
50%.  Since  these  methods  constitute  just  one  of  many  possibilities  for  incorporating  additional  function 
or  derivative  information  into  optimization  algorithms  by  using  nonstandard  models,  this  research  indi¬ 
cates  that  there  may  be  other  interesting  unexplored  possibilities  for  utilizing  nonstandard  models  in 
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unconstrained  optimization  algorithms. 

Research  in  parallel  optimization  methods  is  still  in  its  infancy.  Much  of  the  research  described  in 

Section  5  can  be  considered  to  be  fairly  straightforward  adaptations  or  generalizations  of  standard 

methods.  We  consider  it  likely  that  more  novel  parallel  optimization  research  will  result  from  consider^ 

ing  parallel  methods  for  specific  classes  of  large  scale  optimization  problems.  Indeed,  the  consideration 

of  specific  classes  of  large  scale  optimization  problems,  which  has  been  neglected  in  this  paper,  probably 
holds  many  of  the  future  challenges  for  research  in  unconstrained  optimization. 
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